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Preface to "A Wavelet tour of Signal Processing" 


Note: Additional figures, numerical and programming tools as well as 
exercises for this book may be found at http://www.wavelet-tour.com/ . 


Preface to the Sparse Edition 


I can not help but find striking resemblances between scientific 
communities and schools of fish. We interact in conferences and through 
articles, we move together while a global trajectory emerges from 
individual contributions. Some of us like to be at the center of the school, 
others prefer to wander around, and few swim in multiple directions in 
front. To avoid dying by starvation in a progressively narrower and 
specialized domain, a scientific community needs to move on. 
Computational harmonic analysis is still well alive because it went beyond 
wavelets. Writing such a book is about decoding the trajectory of the 
school, and gathering the pearls that have been uncovered on the way. 
Wavelets are not any more the central topic, despite the original title. It is 
just an important tool, as the Fourier transform is. Sparse representation and 
processing are now at the core. 


In the 80's, many researchers were focused on building time-frequency 
decompositions, trying to avoid the uncertainty barrier, and hoping to 
discover the ultimate representation. Along the way came the construction 
of wavelet orthogonal bases, which opened new perspectives through 
collaborations with physicists and mathematicians. Designing orthogonal 
bases with Xlets became a popular sport, with compression and noise 
reduction applications. Connections with approximations and sparsity also 
became more apparent. The search for sparsity has taken over, leading to 
new grounds, where orthonormal bases are replaced by redundant 
dictionaries of waveforms. Geometry is now also becoming more apparent 
through sparse approximation supports in dictionaries. 


During these last 7 years, I also encountered the industrial world. With a lot 
of naiveness, some bandlets and more mathematics, we created a start-up 
with Christophe Bernard, Jérome Kalifa and Erwan Le Pennec. It took us 
some time to learn that in 3 months good engineering should produce robust 
algorithms that operate in real time, as opposed to the 3 years we were used 
to have for writing new ideas with promissing perspectives. Yet, we 
survived because mathematics is a major source of industrial innovations 
for signal processing. Semi-conductor technology offers amazing 
computational power and flexibility. However, ad-hoc algorithms often do 
not scale easily and mathematics accelerates the trial and error development 
process. Sparsity decreases computations, memory and data 
communications. Although it brings beauty, mathematical understanding is 
not a luxury. It is required by increasingly sophisticated information 
processing devices. 


New Additions 


Putting sparsity at the center of the book implied rewriting many parts and 
adding sections. Chapter 12 and Chapter 13 are new. They introduce sparse 
representations in redundant dictionaries, and inverse problems, super- 
resolution and compressive sensing. Here is a small catalogue of new 
elements in this third edition. 


Radon transform and tomography. 

Lifting for wavelets on surfaces, bounded domains and fast computations. 
JPEG-2000 image compression. 

Block thresholding for denoising. 


Geometric representations with adaptive triangulations, curvelets and 
bandlets. 


Sparse approximations in redundant dictionaries with pursuits algorithms. 


Noise reduction with model selection, in redundant dictionaries. 


Exact recovery of sparse approximation supports in dictionaries. 
Multichannel signal representations and processing. 

Dictionary learning. 

Inverse problems and super-resolution. 

Compressive sensing. 


Source separation. 


Teaching 


This book is intended as a graduate textbook. Its evolution is also the result 
of teaching courses in electrical engineering and applied mathematics. A 
new web site provides softwares for reproducible experimentations, 
exercise solutions, together with teaching material such as slides with 
figures, and Matlab softwares for numerical classes: http://wavelet- 
tour.com. 


More exercises have been added at the end of each chapter, ordered by level 
of difficulty. Level’ exercises are direct applications of the course. Level? 
requires more thinking. Level? includes some technical derivations. Level* 
are projects at the interface of research, that are possible topics for a final 
course project or an independent study. More exercises and projects can be 
found in the web site. 


Sparse Course Programs 


The Fourier transform and analog to digital conversion through linear 
sampling approximations provide a common ground for all courses 
(Chapters 2 and 3). It introduces basic signal representations, and reviews 
important mathematical and algorithmic tools needed afterwards. Many 
trajectories are then possible to explore and teach sparse signal processing. 


The following list gives several topics that can orient the course structure, 
with elements that can be covered along the way. 


Sparse representations with bases and applications 
Principles of linear and non-linear approximations in bases (Chapter 9). 
Lipschitz regularity and wavelet coefficients decay (Chapter 6). 
Wavelet bases (Chapter 7). 


Properties of linear and non-linear wavelet basis approximations (Chapter 
9). 


Image wavelet compression (Chapter 10). 
Linear and non-linear diagonal denoising (Chapter 11). 
Sparse time-frequency representations 


Time-frequency wavelet and windowed Fourier ridges for audio 
processing (Chapter 4). 


Local cosine bases (Chapter 8). 

Linear and non-linear approximations in bases (Chapter 9). 
Audio compression (Chapter 10). 

Audio denoising and block thresholding (Chapter 11). 


Compression and denoising in redundant time-frequency dictionaries, 
with best bases or pursuit algorithms (Chapter 12). 


Sparse signal estimation 


Bayes versus minimax, and linear versus non-linear estimations (Chapter 
IL): 


Wavelet bases (Chapter 7). 
Linear and non-linear approximations in bases (Chapter 9). 
Thresholding estimation (Chapter 11). 
Minimax optimality (Chapter 11). 
Model selection for denoising in redundant dictionaries (Chapter 12). 
Compressive sensing (Chapter 13). 

Sparse compression and information theory 
Wavelet orthonormal bases (Chapter 7). 
Linear and non-linear approximations in bases (Chapter 9). 
Compression and sparse transform codes in bases (Chapter 10). 
Compression in redundant dictionaries (Chapter 12). 
Compressive sensing (Chapter 13). 
Source separation (Chapter 13). 

Dictionary representations and inverse problems 
Frames and Riesz bases (Chapter 5). 
Linear and non-linear approximations in bases (Chapter 9). 
Ideal redundant dictionary approximations (Chapter 12). 
Pursuit algorithms and dictionary incoherence (Chapter 12). 
Linear and thresholding inverse estimators (Chapter 13). 


Super-resolution and source separation (Chapter 13). 


Compressive sensing (Chapter 13). 
Geometric sparse processing 
Time-frequency spectral lines and ridges (Chapter 4). 
Frames and Riesz bases (Chapter 5). 
Multiscale edge representations with wavelet maxima (Chapter 6). 
Sparse approximation supports in bases (Chapter 9). 


Approximations with geometric regularity, curvelets and bandlets 
(Chapters 9 and 12). 


Sparse signal compression and geometric bit budget (Chapters 10 and 12). 
Exact recovery of sparse approximation supports (Chapter 12). 


Super-resolution (Chapter 13). 
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Introduction to "A Wavelet tour of Signal Processing" 

This collection comprises Chapter 1 of the book A Wavelet Tour of Signal 
Processing, The Sparse Way (third edition, 2009) by Stéphane Mallat. The 
book's website at Academic Press is 
http://www.elsevier.com/wps/find/bookdescription.cws_home/714561/descr 
iption#description The book's complementary materials are available at 
http://wavelet-tour.com 


Sparse Representations 


Signals carry overwhelming amounts of data in which relevant information 
is often more difficult to find than a needle in a haystack. Processing is 
faster and simpler in a sparse representation where few coefficients reveal 
the information we are looking for. Such representations can be constructed 
by decomposing signals over elementary waveforms chosen in a family 
called a dictionary. But the search for the Holy Grail of an ideal sparse 
transform adapted to all signals is a hopeless quest. The discovery of 
wavelet orthogonal bases and local time-frequency dictionaries has opened 
the door to a huge jungle of new transforms. Adapting sparse 
representations to signal properties, and deriving efficient processing 
operators, is therefore a necessary survival strategy. 


An orthogonal basis is a dictionary of minimum size that can yield a sparse 
representation if designed to concentrate the signal energy over a set of few 
vectors. This set gives a geometric signal description. Efficient signal 
compression and noise-reduction algorithms are then implemented with 
diagonal operators computedwith fast algorithms. But this is not always 
optimal. 


In natural languages, a richer dictionary helps to build shorter and more 
precise sentences. Similarly, dictionaries of vectors that are larger than 
bases are needed to build sparse representations of complex signals. But 
choosing is difficult and requires more complex algorithms. Sparse 
representations in redundant dictionaries can improve pattern recognition, 
compression, and noise reduction, but also the resolution of new inverse 
problems. This includes superresolution, source separation, and 
compressive sensing. 


This first chapter is a sparse book representation, providing the story line 
and the main ideas. It gives a sense of orientation for choosing a path to 
travel. 


Computational Harmonic Analysis 

This collection comprises Chapter 1 of the book A Wavelet Tour of Signal 
Processing, The Sparse Way (third edition, 2009) by Stéphane Mallat. The 
book's website at Academic Press is 
http://www.elsevier.com/wps/find/bookdescription.cws_home/714561/descr 
iption#description The book's complementary materials are available at 
http://wavelet-tour.com 


Fourier and wavelet bases are the journey's starting point. They decompose 
signals over oscillatory waveforms that reveal many signal properties and 
provide a path to sparse representations. Discretized signals often have a 
very large size , and thus can only be processed by fast algorithms, 
typically implemented with operations and memories. Fourier 
and wavelet transforms illustrate the strong connection between well- 
structured mathematical tools andfast algorithms. 


The Fourier Kingdom 


The Fourier transform is everywhere in physics and mathematics because it 
diagonalizes time-invariant convolution operators. It rules over linear time- 
invariant signal processing, the building blocks of which are frequency 
filtering operators. 


Fourier analysis represents any finite energy function as a sum of 
sinusoidal waves 

Equation: 

The amplitude of each sinusoidal wave is equal to its correlation 


with _, also called Fourier transform: 
Equation: 


The more regular , the faster the decay of the sinusoidal wave 


amplitude when frequency q@ increases. 


When is defined only on an interval, say , then the Fourier 
transform becomes a decomposition in a Fourier orthonormal basis 
of mG is uniformly regular, then its Fourier 


transform coefficients also have a fast decay when the frequency 

increases, so it can be easily approximated with few low-frequency Fourier 
coefficients. The Fourier transform therefore defines a sparse representation 
of uniformly regular functions. 


Over discrete signals, the Fourier transform is a decomposition in a discrete 
orthogonal Fourier basis of — , which has properties 


similar to a Fourier transform on functions. Its embedded structure leads to 
fast Fourier transform (FFT) algorithms, which compute discrete Fourier 
coefficients with instead of N’. This FFT algorithm is a 
cornerstone of discrete signal processing. 


As long as we are satisfied with linear time-invariant operators or uniformly 
regular signals, the Fourier transform provides simple answers to most 
questions. Its richness makes it suitable for a wide range of applications 
such as signal transmissions or stationary signal processing. However, to 
represent a transient phenomenon—a word pronounced at a particular time, 
an apple located in the left corner of an image—the Fourier transform 
becomes a cumbersome tool that requires many coefficients to represent a 
localized event. Indeed, the support of covers the whole real line, so 


depends on the values for all times . This global “mix” of 
information makes it difficult to analyze or represent any local property of 


from 


Wavelet Bases 


Wavelet bases, like Fourier bases, reveal the signal regularity through the 
amplitude of coefficients, and their structure leads to a fast computational 
algorithm. However, wavelets are well localized and few coefficients are 
needed to represent local transient structures. As opposed to a Fourier basis, 
a wavelet basis defines a sparse representation of piecewise regular signals, 
which may include transients and singularities. In images, large wavelet 
coefficients are located in the neighborhood of edges and irregular textures. 


The story began in 1910, when Haar (Haar:10) constructed a piecewise 
constantfunction 
Equation: 


the dilations and translations of which generate an orthonormal basis 
Equation: 


of the space of signals having a finite energy 
Equation: 
Let us write —the inner product in 


Any finite energy signal _can thus be represented by its wavelet inner- 
product coefficients 
Equation: 


and recovered by summing them in this wavelet orthonormal basis: 
Equation: 


Each Haar wavelet has a zero average over its support 
.If  islocally regular and is small, then it is nearly 
constant over this interval and the wavelet coefficient is nearly 


zero. This means that large wavelet coefficients are located at sharp signal 
transitions only. 


With a jump in time, the story continues in 1980, when Strémberg 
(Stromberg:81) found a piecewise linear function w that also generates an 
orthonormal basis and gives better approximations of smooth functions. 
Meyer was not aware of this result, and motivated by the work of Morlet 
and Grossmann over continuous wavelet transform, he tried to prove that 
there exists no regular wavelet w that generates an orthonormal basis. This 
attempt was a failure since he ended up constructing a whole family of 
orthonormal wavelet bases, with functions w that are infinitely continuously 
differentiable (Meyer:86). This was the fundamental impulse that led to a 
widespread search for new orthonormal wavelet bases, which culminated in 
the celebrated Daubechies wavelets of compact support (Daubechies:88). 


The systematic theory for constructing orthonormal wavelet bases was 
established by Meyer and Mallat through the elaboration of multiresolution 
signal approximations (Mallat:89b), as presented in Chapter 7. It was 
inspired by original ideas developed in computer vision by Burt and 
Adelson (BurtA:83) to analyze images at several resolutions. Digging 
deeper into the properties of orthogonal wavelets and multiresolution 
approximations brought to light a surprising link with filter banks 


constructed with conjugate mirror filters, and a fast wavelet transform 
algorithm decomposing signals of size N with operations 
(Mallat:89). 


Filter Banks 


Motivated by speech compression, in 1976 Croisier, Esteban, and Galand 
(CroisierEG:76) introduced an invertible filter bank, which decomposes a 
discrete signal into two signals of half its size using a filtering and 
subsampling procedure. They showed that can be recovered from 
these subsampled signals by canceling the aliasing terms with a particular 
class of filters called conjugate mirror filters. This breakthrough led to a 10- 
year research effort to build a complete filter bank theory. Necessary and 
sufficient conditions for decomposing a signal in subsampled components 
with a filtering scheme, and recovering the same signal with an inverse 
transform, were established by Smith and Barnwell (SmithB:84), 
Vaidyanathan (Vaidyanathan:87), andVetterli (Vetterli:86). 


The multiresolution theory of Mallat (Mallat:89b) and Meyer (Meyer:92c) 
proves that any conjugate mirror filter characterizes a wavelet wW that 
generates an orthonormal basis of , and that a fast discrete wavelet 
transform is implemented by cascading these conjugate mirror filters 
(Mallat:89). The equivalence between this continuous time wavelet theory 
and discrete filter banks led to a new fruitful interface between digital 
signal processing and harmonic analysis, first creating a culture shock that 
is now well resolved. 


Continuous versus Discrete and Finite 


Originally, many signal processing engineers were wondering what is the 
point of considering wavelets and signals as functions, since all 
computations are performed over discrete signals with conjugate mirror 
filters. Why bother with the convergence of infinite convolution cascades if 
in practice we only compute a finite number of convolutions? Answering 


these important questions is necessary in order to understand why this book 
alternates between theorems on continuous time functions and discrete 
algorithms applied to finite sequences. 


A short answer would be “simplicity.” In , a wavelet basis is 
constructedby dilating and translating a single function w. Several important 
theorems relate the amplitude of wavelet coefficients to the local regularity 
of the signal . Dilations are not defined over discrete sequences, and 
discrete wavelet bases are therefore more complex to describe. The 
regularity of a discrete sequence is not well defined either, which makes it 
more difficult to interpret the amplitude of wavelet coefficients. A theory of 
continuous-time functions gives asymptotic results for discrete sequences 
with sampling intervals decreasing to zero. This theory is useful because 
these asymptotic results are precise enough to understand the behavior of 
discrete algorithms. 


But continuous time or space models are not sufficient for elaborating 
discrete signal-processing algorithms. The transition between continuous 
and discrete signals must be done with great care to maintain important 
properties such as orthogonality. Restricting the constructions to finite 
discrete signals adds another layer of complexity because of border 
problems. How these border issues affect numerical implementations is 
carefully addressed once the properties of the bases are thoroughly 
understood. 


Wavelets for Images 


Wavelet orthonormal bases of images can be constructed from wavelet 
orthonormal bases of one-dimensional signals. Three mother wavelets 

: , and , with , are dilated by and 
translated by with . This yields an orthonormal 
basis of the space of finite energy functions 
Equation: 


The support of a wavelet is a square of width proportional to the scale 


. Two-dimensional wavelet bases are discretized to define orthonormal 
bases of images including N pixels. Wavelet coefficients are calculated with 


the fast algorithm described in Chapter 7. 
Like in one dimension, a wavelet coefficient has a small 
amplitude if is regular over the support of . It has a large 


amplitude near sharp transi-tions such as edges. Figure (b) is the array of N 
wavelet coefficients. Each direction k and scale corresponds to a 
subimage, which shows in black the position of the largest coefficients 
above a threshold: 


(a) Discrete image of pixels. (b) Array of N 


orthogonal wavelet coefficients for , and 4 
scales _ ; black points correspond to . (c) Linear 
approximation from the wavelet coefficients at the three 


largest scales. (d) Nonlinear approximation from the 
wavelet coefficients of largest amplitude shown in (b). 


Approximation and Processing in Bases 

This collection comprises Chapter 1 of the book A Wavelet Tour of Signal 
Processing, The Sparse Way (third edition, 2009) by Stéphane Mallat. The 
book's website at Academic Press is 
http://www.elsevier.com/wps/find/bookdescription.cws_home/714561/descripti 
on#description The book's complementary materials are available at 
http://wavelet-tour.com 


Analog-to-digital signal conversion is the first step of digital signal processing. 
Chapter 3 explains that it amounts to projecting the signal over a basis of an 
approximation space. Most often, the resulting digital representation remains 
much too large and needs to be further reduced. A digital image typically 
includes more than 10° samples and a CD music recording has 40 x 10° 
samples per second. Sparse representations that reduce the number of 
parameters can be obtained by thresholding coefficients in an appropriate 
orthogonal basis. Efficient compression and noise-reduction algorithms are then 
implemented with simple operators in this basis. 


Stochastic versus Deterministic Signal Models 


A representation is optimized relative to a signal class, corresponding to all 
potential signals encountered in an application. This requires building signal 
models that carry available prior information. 


A signal f can be modeled as a realization of arandom process F’, the 
probability distribution of which is known a priori. A Bayesian approach then 
tries to minimize the expected approximation error. Linear approximations are 
simpler because they only depend on the covariance. Chapter 9 shows that 
optimal linear approximations are obtained on the basis of principal 
components that are the eigenvectors of the covariance matrix. However, the 
expected error of nonlinear approximations depends on the full probability 
distribution of F’. This distribution is most often not known for complex 
signals, such as images or sounds, because their transient structures are not 
adequately modeled as realizations of known processes such as Gaussian ones. 


To optimize nonlinear representations, weaker but sufficiently powerful 
deterministic models can be elaborated. A deterministic model specifies a set 0, 
where the signal belongs. This set is defined by any prior information—for 
example, on the time-frequency localization of transients in musical recordings 


or on the geometric regularity of edges in images. Simple models can also 
define © as a ball in a functional space, with a specific regularity norm such as 
a total variation norm. A stochastic model is richer because it provides the 
probability distribution in ©. When this distribution is not available, the 
average error cannot be calculated and is replaced by the maximum error over 
©. Optimizing the representation then amounts to mini-mizing this maximum 
error, which is called a minimax optimization. 


Sampling with Linear Approximations 


Analog-to-digital signal conversion is most often implemented with a linear 
approximation operator that filters and samples the input analog signal. From 
these samples, a linear digital-to-analog converter recovers a projection of the 
original analog signal over an approximation space whose dimension depends 
on the sampling density. Linear approximations project signals in spaces of 
lowest possible dimensions to reduce computations and storage cost, while 
controlling the resulting error. 


Sampling Theorems 


= eee P 
Let us consider finite energy signals || f ||?= [ | f («)| dz of finite support, 


which is normalized to [0, 1] or [0, 1) for images. A sampling process 


implements a filtering of f (x) with a low-pass impulse response @, (x) anda 
uniform sampling to output a discrete signal: 
Equation: 


finJ= f *@; (ns) for 0<n<QN. 


In two dimensions, n = (1,2) and x = (21, £2). These filtered samples can 
also be written as inner products: 
Equation: 


Fx Ge(ns)= f Fu) Fe(ns—u)du= ( f(e)¥. (ens) ) 


with y, (x) = @,; (—ax). Chapter 3 explains that @, is chosen, like in the classic 
Shannon—Whittaker sampling theorem, so that a family of functions 

{ps (« — ns)},~,,<y is a basis of an appropriate approximation space Uy. The 
best linear approximation of f in Uy recovered from these samples is the 
orthogonal projection f y of f in Uy, and if the basis is orthonormal, then 
Equation: 


A sampling theorem states that if f € Uy then f = fy sorecovers f (x) 


from the measured samples. Most often, f does not belong to this 
approximation space. It is called aliasing in the context of Shannon—Whittaker 
sampling, where Uy is the space of functions having a frequency support 
restricted to the N lower frequencies. The approximation error || f — f n ||? 
must then be controlled. 


Linear Approximation Error 


The approximation error is computed by finding an orthogonal basis 
B = {Gm (£) }o<me+oo of the whole analog signal space L? (R)(0, 1)’, with 
the first N vector {9m (£)}o<m<y that defines an orthogonal basis of Uy. Thus, 


the orthogonal projection on Uy can be rewritten as 
Equation: 


f(x) = ¥( F.dm) a (x). 


=0 


Since f = er ie Gm) 9m, the approximation error is the energy of the 


removed inner products: 
Equation: 


+00 
e. (N, f) =|| f—-fwy \|?= S° ae 
m=N 


This error decreases quickly when N increases if the coefficient amplitudes 
|( £;Gm )| have a fast decay when the index m increases. The dimension N is 
adjusted to the desired approximation error. 


Figure (a) shows a discrete image f[n]| approximated with N = 2567 pixels. 
Figure (c) displays a lower-resolution image fy /1¢ projected on a space U wy /1¢ 
of dimension N/16, generated by N/16 large-scale wavelets. It is calculated 
by setting all the wavelet coefficients to zero at the first two smaller scales. The 
approximation error is || f—fi/ie ||? /|| f |? = 14 x 10-. Reducing the 
resolution introduces more blur and errors. A linear approximation space Uy 
corresponds to a uniform grid that approximates precisely uniform regular 
signals. Since images f are often not uniformly regular, it is necessary to 
measure it at a high-resolution N. This is why digital cameras have a resolution 
that increases as technology improves. 


Sparse Nonlinear Approximations 


Linear approximations reduce the space dimensionality but can introduce 
important errors when reducing the resolution if the signal is not uniformly 
regular, as shown by Figure (c). To improve such approximations, more 
coefficients should be kept where needed—not in regular regions but near sharp 
transitions and edges.This requires defining an irregular sampling adapted to 
the local signal regularity. This optimized irregular sampling has a simple 
equivalent solution through nonlinear approximations in wavelet bases. 


Nonlinear approximations operate in two stages. First, a linear operator 
approximates the analog signal f with N samples written 

f |n] = f * ©, (ns). Then, a nonlinear approximation of f[n] is computed 
to reduce the N coefficients f|n] to M < N coefficients in a sparse 
representation. 


The discrete signal f can be considered as a vector of C™’. Inner products and 
norms in C% are written 


Equation: 


(4) =  flnjg"[n| and |i fP? = Fin 


n=0 


To obtain a sparse representation with a nonlinear approximation, we choose a 
new orthonormal basis B = {gm [n]},,<p of C%, which concentrates the 
signal energy as much as possible over few coefficients. Signal coefficients 

{( f,9m)}mep are computed from the N input sample values f[n] with an 
orthogonal change of basis that takes N* operations in nonstructured bases. In a 
wavelet or Fourier bases, fast algorithms require, respectively, O(.V) and 
O(N log, N) operations. 


Approximation by Thresholding 


For M < N, an approximation fj is computed by selecting the “best” 
M < N vectors within &. The orthogonal projection of f on the space V) 
generated by M vectors {9m},,cq in B is 

Equation: 


= (sam) on 


mex 


Since f = er be gm) Ym, the resulting error is 
Equation: 


tpl = liga 


m¢r 


We write |A| the size of the set A. The best MM = |A| term approximation, which 
minimizes || f—f, ||, is thus obtained by selecting the M coefficients of 


largest amplitude. These coefficients are above a threshold T that depends on 
M: 
Equation: 


fu = frr = S- f,9m Gm with Ar= {Mey : Cf, 9m ata e 


meAr 


This approximation is nonlinear because the approximation set A; changes with 
f. The resulting approximation error is: 
Equation: 


én (M, f) =|| f—fu ?= S© Cf, 9m)I’. 


m¢Ar 


[link](b) shows that the approximation support A7 of an image in a wavelet 
orthonormal basis depends on the geometry of edges and textures. Keeping 
large wavelet coefficients is equivalent to constructing an adaptive 
approximation grid specified by the scale—space support Ar. It increases the 
approximation resolution where the signal is irregular. The geometry of A; 
gives the spatial distribution of sharp image transitions and edges, and their 
propagation across scales. Chapter 6 proves that wavelet coefficients give 
important information about singularities and local Lipschitz regularity. This 
example illustrates how approximation support provides “geometric” 
information on f, relative to a dictionary, that is a wavelet basis in this 
example. 


[link](d) gives the nonlinear wavelet approximation fy recovered from the 

M = N/16 large-amplitude wavelet coefficients, with an error 

|| f-fac ||? /|| £ ||? =5 x 10°°. This error is nearly three times smaller than 
the linear approximation error obtained with the same number of wavelet 
coefficients, and the image quality is much better. 


An analog signal can be recovered from the discrete nonlinear approxima-tion 
fu: 
Equation: 


Since all projections are orthogonal, the overall approximation error on the 
original analog signal f (x) is the sum of the analog sampling error and the 
discrete nonlinear error: 

Equation: 


| fF — fa P= fF - Fw lP +I f- far PP =e (Ny f) +n (M, f). 


In practice, N is imposed by the resolution of the signal-acquisition hardware, 
and M is typically adjusted so that e, (M, f) > «1: (N, f). 


Sparsity with Regularity 


Sparse representations are obtained in a basis that takes advantage of some 
form of regularity of the input signals, creating many small-amplitude 
coefficients. Since wavelets have localized support, functions with isolated 
singularities produce few large-amplitude wavelet coefficients in the 
neighborhood of these singularities. Nonlinear wavelet approximation produces 
a small error over spaces of functions that do not have “too many” sharp 
transitions and singularities. Chapter 9 shows that functions having a bounded 
total variation norm are useful models for images with nonfractal (finite length) 
edges. 


Edges often define regular geometric curves. Wavelets detect the location of 
edges but their square support cannot take advantage of their potential 
geometric regularity. More sparse representations are defined in dictionaries of 
curvelets or bandlets, which have elongated support in multiple directions, that 
can be adapted to this geometrical regularity. In such dictionaries, the 
approximation support A; is smaller but provides explicit information about 
edges' local geometrical properties such as their orientation. In this context, 
geometry does not just apply to multidimensional signals. Audio signals, such 
as musical recordings, also have a complex geometric regularity in time- 
frequency dictionaries. 


Compression 


Storage limitations and fast transmission through narrow bandwidth channels 
require compression of signals while minimizing degradation. Transform codes 
compress signals by coding a sparse representation. Chapter 10 introduces the 
information theory needed to understand these codes and to optimize their 
performance. 


In a compression framework, the analog signal has already been discretized 
into a signal f|n] of size N. This discrete signal is decomposed in an 
orthonormal basis @ = {gm},,<p of C™: 

Equation: 


f= (han) 


mel 


Coefficients ( f, gm) are approximated by quantized values Q(( f, gm)). If Q is 
auniform quantizer of step A, then |x — Q(a)| < A/2; and if || < A/2, then 
Q(x) = 0. The signal f restored from quantized coefficients is 

Equation: 


F= 32 Q((F,9:)) Gm: 


mel 


An entropy code records these coefficients with R bits. The goal is to minimize 
the signal-distortion rate d(R, f) =|| f—f ||”. 


The coefficients not quantized to zero correspond to the set 

Ar ={mevy:|(f, 9m) |> T} with T = A/2. For sparse signals, Chapter 10 
shows that the bit budget R is dominated by the number of bits to code A; in y, 
which is nearly proportional to its size |r|. This means that the “information” 
about a sparse representation ismostly geometric. Moreover, the distortion is 
dominated by the nonlinear approximation error || f—fa, ||?, for 

i) oe v4 7; Gn) Qm- Compression is thus a sparse approximation 
problem. For a given distortion d(R, f), minimizing R requires reducing |A7| 
and thus optimizing the sparsity. 


The number of bits to code A; can take advantage of any prior information on 
the geometry. [link](b) shows that large wavelet coefficients are not randomly 
distributed. They have a tendency to be aggregated toward larger scales, and at 
fine scales they are regrouped along edge curves or in texture regions. Using 
such prior geometric models is a source of gain in coders such as JPEG-2000. 


Chapter 10 describes the implementation of audio transform codes. Image 
transform codes in block cosine bases and wavelet bases are introduced, 
together with the JPEG and JPEG-2000 compression standards. 


Denoising 


Signal-acquisition devices add noise that can be reduced by estimators using 
prior information on signal properties. Signal processing has long remained 
mostly Bayesian and linear. Nonlinear smoothing algorithms existed in 
Statistics, but these procedures were often ad hoc and complex. Two 
statisticians, Donoho andJohnstone (DonohoJ:94), changed the “game” by 
proving that simple thresholding in sparse representations can yield nearly 
optimal nonlinear estimators. This was the beginning of a considerable 
refinement of nonlinear estimation algorithms that is still ongoing. 


Let us consider digital measurements that add a random noise W|n] to the 
original signal f[n/!: 
Equation: 


X|n| = f[n]}+W[n| for 0<n<QN. 
The signal f is estimated by transforming the noisy data X with an operator D: 
Equation: 
F= DX. 
The risk of the estimator F of f is the average error, calculated with respect to 


the probability distribution of noise W: 
Equation: 


r(D, f) = E{|| f — DX ||"}. 


Bayes versus Minimax 


To optimize the estimation operator D, one must take advantage of prior 
information available about signal f. In a Bayes framework, f is considered a 
realization of a random vector F' and the Bayes risk is the expected risk 
calculated with respect to the prior probability distribution m of the random 
signal model F’: 

Equation: 


(Dyn) = EAD F)}. 


Optimizing D among all possible operators yields the minimum Bayes risk: 
Equation: 


—inf r(D,7). 
Tn, (7) inf r ( , 1) 


In the 1940s, Wald brought in a new perspective on statistics with a decision 
theory partly imported from the theory of games. This point of view uses 
deterministic models, where signals are elements of a set ©, without specifying 
their probability distribution in this set. To control the risk for any f € O, we 
compute the maximum risk: 

Equation: 


r (D, QO) =sup r (D, f). 
feo 


The minimax risk is the lower bound computed over all operators D: 
Equation: 


rn (O) = inf r(D,@). 


In practice, the goal is to find an operator D that is simple to implement and 
yields a risk close to the minimax lower bound. 


(a) (b) 


Thresholding Estimators 


It is tempting to restrict calculations to linear operators D because of their 
simplicity. Optimal linear Wiener estimators are introduced in Chapter 11. 
Figure (a) is an image contaminated by Gaussian white noise. Figure (b) shows 
an optimized linear filtering estimation F = X + h [nl], which is therefore 
diagonal in a Fourier basis &. This convolution operator averages the noise but 
also blurs the image and keeps low-frequency noise by retaining the image's 
low frequencies. 


If f has a sparse representation in a dictionary, then projecting X on the vectors 
of this sparse support can considerably improve linear estimators. The difficulty 
is identifying the sparse support of f from the noisy data X. Donoho and 
Johnstone (DonohoJ:94) proved that, in an orthonormal basis, a simple 
thresholding of noisy coefficients does the trick. Noisy signal coefficients in an 
orthonormal basis¥ = {gm} ep are 

Equation: 


(X, 9m) =( 59m) + (W, 9m) for mevy. 


Thresholding these noisy coefficients yields an orthogonal projection estimator 
Equation: 


F=X, = Y (0m) a with Ap={mey : (an) 7 


méeAr 


The set A is an estimate of an approximation support of f. It is hopefully 
close to the optimal approximation supportAr = {mevy : |(f,gm)|> Th. 


[link](b) shows the estimated approximation set Ny of noisy-wavelet 
coefficients, |(X, ~;,|= 7’, that can be compared to the optimal approximation 
support Az; shown in [link](b). The estimation in [link](d) from wavelet 
coefficients in Ar has considerably reduced the noise in regular regions while 
keeping the sharpness of edges by preserving large-wavelet coefficients. This 
estimation is improved with a translation-invariant procedure that averages this 
estimator over several translated wavelet bases. Thresholding wavelet 
coefficients implements an adaptive smoothing, which averages the data X with 
a kernel that depends on the estimated regularity of the original signal f. 


Donoho and Johnstone proved that for Gaussian white noise of variance o7, 


choosing T’' = 0/2 log, N yields a risk E { | f= Pr} of the order of 


| f=fan I", toa log, N factor. This spectacular result shows that the 


estimated support Ar does nearly as well as the optimal unknown support Ar. 
The resulting risk is small if the representation is sparse and precise. 


The set Ar in [link](b) “looks” different from the A; in [link](b) because it has 
more isolated points. This indicates that some prior information on the 
geometry of A; could be used to improve the estimation. For audio noise- 
reduction, thresholding estimators are applied in sparse representations 
provided by time-frequency bases. Similar isolated time-frequency coefficients 
produce a highly annoying “musical noise.” Musical noise is removed with a 


block thresholding that regularizes the geometry of the estimated support Ar 
and avoids leaving isolated points. Block thresholding also improves wavelet 
estimators. 


If W is a Gaussian noise and signals in @ have a sparse representation in &, 
then Chapter 11 proves that thresholding estimators can produce a nearly 
minimax risk. In particular, wavelet thresholding estimators have a nearly 
minimax risk for large classes of piecewise smooth signals, including bounded 
variation images. 


Time-Frequency Dictionaries 

This collection comprises Chapter 1 of the book A Wavelet Tour of Signal 
Processing, The Sparse Way (third edition, 2009) by Stéphane Mallat. The 
book's website at Academic Press is 
http://www.elsevier.com/wps/find/bookdescription.cws_home/714561/descripti 
on#description The book's complementary materials are available at 
http://wavelet-tour.com 


Motivated by quantum mechanics, in 1946 the physicist Gabor (Gabor:46) 
proposed decomposing signals over dictionaries of elementary waveforms 
which he called time-frequency atoms that have a minimal spread in a time- 
frequency plane. By showing that such decompositions are closely related to 
our perception of sounds, and that they exhibit important structures in speech 
and music recordings, Gabor demonstrated the importance of localized time- 
frequency signal processing. Beyond sounds, large classes of signals have 
sparse decompositions as sums of time-frequency atoms selected from 
appropriate dictionaries. The key issue is to understand how to construct 
dictionaries with time-frequency atoms adapted to signal properties. 


Heisenberg Uncertainty 


Heisenberg box representing 
an atom Qy. 


A time-frequency dictionary is composed of waveforms of unit 
norm , which have a narrow localization in time and frequency. The 


time localization u of gy and its spread around u, are defined by 
Equation: 


Similarly, the frequency localization and spread of _ are defined by 
Equation: 


The Fourier Parseval formula 


Equation: 
shows that depends mostly on the values and , where 
and are nonnegligible , and hence for in a rectangle centered at 
, of size . This rectangle is illustrated by [link] in this time- 
frequency plane . It can be interpreted as a “quantum of information” over 


an elementary resolution cell. The uncertainty principle theorem proves (see 
Chapter 2) that this rectangle has a minimum surface that limits the joint time- 
frequency resolution: 

Equation: 


Constructing a dictionary of time-frequency atoms can thus be thought of as 
covering the time-frequency plane with resolution cells having a time width 

anda frequency width which may vary but with a surface larger than 
one-half. Windowed Fourier and wavelet transforms are two important 
examples. 


Windowed Fourier Transform 


A windowed Fourier dictionary is constructed by translating in time and 


frequency a time window , of unit norm , centered at 
Equation: 
The atom is translated by u in time and by é in frequency. The time-and- 
frequency spread of is independent of u and €. This means that each atom 
corresponds to a Heisenberg rectangle that has a size independent 
of its position , as shown by [link]. 
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Time-frequency boxes (“Heisenberg rectangles”) 
representing the energy spread of two windowed 
Fourier atoms. 


The windowed Fourier transform projects on each dictionary atom 
Equation: 


It can be interpreted as a Fourier transform of at the frequency €, localized by 
the window in the neighborhood of u. This windowed Fourier 
transform is highly redundant and represents one-dimensional signals by a 
time-frequency image in . It is thus necessary to understand how to select 
many fewer time-frequency coefficients that represent the signal efficiently. 


When listening to music, we perceive sounds that have a frequency that varies 
in time. Chapter 4 shows that a spectral line of | creates high-amplitude 
windowed Fourier coefficients at frequencies that depend on 
time u. These spectral components are detected and characterized by ridge 
points, which are local maxima in this time-frequency plane. Ridge points 
define a time-frequency approximation support A of | with a geometry that 
depends on the time-frequency evolution of the signal spectral components. 
Modifying the sound duration or audio transpositions are implemented by 
modifying the geometry of the ridge support in time frequency. 


A windowed Fourier transform decomposes signals over waveforms that have 
the same time and frequency resolution. It is thus effective as long as the signal 
does not include structures having different time-frequency resolutions, some 
being very localized in time and others very localized in frequency. Wavelets 
address this issue by changing the time and frequency resolution. 


Continuous Wavelet Transform 


In reflection seismology, Morlet knew that the waveforms sent underground 
have a duration that is too long at high frequencies to separate the returns of 
fine, closely spaced geophysical layers. Such waveforms are called wavelets in 
geophysics. Instead of emitting pulses of equal duration, he thought of sending 
shorter waveforms at high frequencies. These waveforms were obtained by 
scaling the mother wavelet, hence the name of this transform. Although 
Grossmann was working in theoretical physics, he recognized in Morlet's 


approach some ideas that were close to his own work on coherent quantum 
states. 


Nearly forty years after Gabor, Morlet and Grossmann reactivated a 
fundamental collaboration between theoretical physics and signal processing, 
which led to the formalization of the continuous wavelet transform 
(GrossmannM:84). These ideas were not totally new to mathematicians 
working in harmonic analysis, or to computer vision researchers studying 
multiscale image processing. It was thus only the beginning of a rapid catalysis 
that brought together scientists with very different backgrounds. 


A wavelet dictionary is constructed from a mother wavelet w of zero average 
Equation: 


which is dilated with a scale parameter s, and translated by u: 
Equation: 


The continuous wavelet transform of at any scale s and position u is the 
projection of on the corresponding wavelet atom: 
Equation: 


It represents one-dimensional signals by highly redundant time-scale images in 


Varying Time-Frequency Resolution 


As opposed to windowed Fourier atoms, wavelets have a time-frequency 
resolution that changes. The wavelet has a time support centered at u and 


proportional to s. Let us choose a wavelet w whose Fourier transform is 
nonzero in a positive frequency interval centered at n. The Fourier transform 
is dilated by and thus is localized in a positive frequency interval 
centered at ;itssizeisscaled by _ . Inthe time-frequency plane, the 
Heisenberg box of a wavelet atom is therefore a rectangle centered at 
, with time and frequency widths, respectively, proportional to s and 
. When s varies, the time and frequency width of this time-frequency 
resolution cell changes, but its area remains constant, as illustrated by [link]. 


Large-amplitude wavelet coefficients can detect and measure short high- 
frequency variations because they have a narrow time localization at high 
frequencies. At low frequencies their time resolution is lower, but they have a 
better frequency resolution. This modification of time and frequency resolution 
is adapted to represent sounds with sharp attacks, or radar signals having a 
frequency that may vary quickly at high frequencies. 


Multiscale Zooming 


A wavelet dictionary is also adapted to analyze the scaling evolution of 
transients with zooming procedures across scales. Suppose now that w is real. 
Since it has a zero average, a wavelet coefficient measures the 
variation of ina neighborhood of u that has a size proportional to s. Sharp 
signal transitions create large-amplitude wavelet coefficients. 


Heisenberg time-frequency boxes of two 
wavelets, and . When the scale s 
decreases, the time support is reduced but the 
frequency spread increases and covers an 
interval that is shifted toward high frequencies. 


Signal singularities have specific scaling invariance characterized by Lipschitz 
exponents. Chapter 6 relates the pointwise regularity of | to the asymptotic 
decay of the wavelet transform amplitude when s goes to zero. 
Singularities are detected by following the local maxima of the wavelet 
transform acrossscales. 


In images, wavelet local maxima indicate the position of edges, which are sharp 
variations of image intensity. It defines scale-space approximation support of 
from which precise image approximations are reconstructed. At different scales, 
the geometry of this local maxima support provides contours of image 
structures of varying sizes. This multiscale edge detection is particularly 
effective for pattern recognition in computer vision (Canny:86). 


The zooming capability of the wavelet transform not only locates isolated 
singular events, but can also characterize more complex multifractal signals 
having nonisolated singularities. Mandelbrot (Mandelbrot:82) was the first to 
recognize the existence of multifractals in most corners of nature. Scaling one 
part of a multifractal produces a signal that is statistically similar to the whole. 
This self-similarity appears in the continuous wavelet transform, which 
modifies the analyzing scale. From global measurements of the wavelet 
transform decay, Chapter 6 measures the singularity distribution of 
multifractals. This is particularly important in analyzing their properties and 
testing multifractal models in physics or in financial time series. 


Time-Frequency Orthonormal Bases 


Orthonormal bases of time-frequency atoms remove all redundancy and define 
stable representations. A wavelet orthonormal basis is an example of the time- 


frequency basis obtained by scaling a wavelet w with dyadic scales and 
translating it by , which is written . In the time-frequency plane, the 
Heisenberg resolution box of isadilation by and translation by of 


the Heisenberg box of w. A wavelet orthonormal is thus a subdictionary of the 
continuous wavelet transform dictionary, which yields a perfect tiling of the 
time-frequency plane illustrated in [link]. 
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The time-frequency boxes of a wavelet basis 
define a tiling of the time-frequency plane. 


One can construct many other orthonormal bases of time-frequency atoms, 
corresponding to different tilings of the time-frequency plane. Wavelet packet 
and local cosine bases are two important examples constructed in Chapter 8, 
with time-frequency atoms that split the frequency and the time axis, 
respectively, in intervals of varying sizes. 


Wavelet Packet Bases 


Wavelet bases divide the frequency axis into intervals of 1 octave bandwidth. 
Coifman, Meyer, and Wickerhauser (CoifmanMW:92) have generalized this 
construction with bases that split the frequency axis in intervals of bandwidth 
that may be adjusted. Each frequency interval is covered by the Heisenberg 
time-frequency boxes of wavelet packet functions translated in time, in order to 
cover the whole plane, as shown by [link]. 


As for wavelets, wavelet-packet coefficients are obtained with a filter bank of 
conjugate mirror filters that split the frequency axis in several frequency 
intervals. Different frequency segmentations correspond to different wavelet 
packet bases. For images, a filter bank divides the image frequency support in 
squares of dyadic sizes that can be adjusted. 


A wavelet packet basis divides the 
frequency axis in separate intervals of 
varying sizes. A tiling is obtained by 
translating in time the wavelet packets 

covering each frequency interval. 


Local Cosine Bases 


Local cosine orthonormal bases are constructed by dividing the time axis 
instead of the frequency axis. The time axis is segmented in successive 
intervals . The local cosine bases of Malvar (Malvar:88) are obtained 
by designing smooth windows that cover each interval , and by 
multiplying them by cosine functions of different frequencies. 
This is yet another idea that has been independently studied in physics, signal 
processing, and mathematics. Malvar's original construction was for discrete 
signals. At the same time, the physicist Wilson (Wilson:87) was designing a 
local cosine basis, with smooth windows of infinite support, to analyze the 
properties of quantum coherent states. Malvar bases were also rediscovered and 
generalized by the harmonic analysts Coifman and Meyer (CoifmanM:91). 
These different views of the same bases brought to light mathematical and 
algorithmic properties that opened new applications. 


A multiplication by translates the Fourier transform of 

by _ . Over positive frequencies, the time-frequency box of the 
modulated window is therefore equal to the time-frequency 
box of g, translated by € along frequencies. [link] shows the time-frequency 
tiling corresponding to such a local cosine basis. For images, a two-dimensional 
cosine basis is constructed by dividing the image support in squares of varying 
sizes. 


Sparsity in Redundant Dictionaries 

This collection comprises Chapter 1 of the book A Wavelet Tour of Signal 
Processing, The Sparse Way (third edition, 2009) by Stéphane Mallat. The 
book's website at Academic Press is 
http://www.elsevier.com/wps/find/bookdescription.cws_home/714561/descr 
iption#description The book's complementary materials are available at 
http://wavelet-tour.com 


In natural languages, large dictionaries are needed to refine ideas with short 
sentences, and they evolve with usage. Eskimos have eight different words 
to describe snow quality, whereas a single word is typically sufficient in a 
Parisian dictionary.Similarly, large signal dictionaries of vectors are needed 
to construct sparse representations of complex signals. However, computing 
and optimizing a signal approximation by choosing the best M dictionary 
vectors is much more difficult. 
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A local cosine basis divides the time axis with 
smooth windows g, (t) and translates these 


windows into frequency. 


Frame Analysis and Synthesis 


Suppose that a sparse family of vectors {Pp}, a, has been selected to 


approximate a signal f. An approximation can be recovered as an 
orthogonal projection in the space V, generated by these vectors. We then 
face one of the following twoproblems. 


1. In a dual-synthesis problem, the orthogonal projection f, of f in Vy 


must be computed from dictionary coefficients, {( f, yp) in as 


provided by an analysis operator. This is the case when a signal 
transform {( f, pp) } per iS calculated in some large dictionary and a 


subset of inner products are selected. Such inner products may 
correspond to coefficients above a threshold or local maxima values. 

2. In a dual-analysis problem, the decomposition coefficients of f, must 
be computed on a family of selected vectors {Pp}, <a: This problem 
appears when sparse representation algorithms select vectors as 
opposed to inner products. This is the case for pursuit algorithms, 
which compute approximation supports in highly redundant 
dictionaries. 


The frame theory gives energy equivalence conditions to solve both 
problems with stable operators. A family {Pp}, <a is a frame of the space V 


it generates if there exists B > A > 0 such that 
Equation: 


WREV, All hl’ < Sol(h, gy)? < Bi hI’. 


mex 


The representation is stable since any perturbation of frame coefficients 
implies a modification of similar magnitude on h. Chapter 5 proves that the 


existence of a dual frame {Pp}, ca that solves both the dual-synthesis and 


dual-analysisproblems: 
Equation: 


p= > ie rn) Pp = >( 1) Pp: 


per per 


Algorithms are provided to calculate these decompositions. The dual frame 
is also stable: 
Equation: 


WEV. Bo PPS) Mie) =2 ls 


Mey 


The frame bounds A and B are redundancy factors. If the vectors {Yp} pep 


are normalized and linearly independent, then A < 1 < B. Sucha 
dictionary is called a Riesz basis of V and the dual frame is biorthogonal: 
Equation: 


V(p, pl) € A*, (Pp, Gp) = 6[p — pl]. 


When the basis is orthonormal, then both bases are equal. Analysis and 
synthesis problems are then identical. 


The frame theory is also used to construct redundant dictionaries that define 
complete, stable, and redundant signal representations, where V is then the 
whole signal space. The frame bounds measure the redundancy of such 
dictionaries. Chapter 5 studies the construction of windowed Fourier and 
wavelet frame dictionaries by sampling their time, frequency, and scaling 
parameters, while controlling frame bounds. In two dimensions, directional 
wavelet frames include wavelets sensitive to directional image structures 
such as textures or edges. 


To improve the sparsity of images having edges along regular geometric 
curves, Candés and Donoho (CandesD:99) introduced curvelet frames, with 
elongated waveforms having different directions, positions, and scales. 
Images with piecewise regular edges have representations that are 
asymptotically more sparse by thresholding curvelet coefficients than 
wavelet coefficients. 


Ideal Dictionary Approximations 


In a redundant dictionary 9 = {yp} ,,» we would like to find the best 


approximation support A with M = |A| vectors, which minimize the error 
|| ff) ||?. Chapter 12 proves that it is equivalent to find Az, which 
minimizes the corresponding approximation Lagrangian 

Equation: 


LZ (T, f,) =|] f-fa ll? +7? [Al 


for some multiplier T. 


Compression and denoising are two applications of redundant dictionary 
approximations. When compressing signals by quantizing dictionary 
coefficients, the distortion rate varies, like the Lagrangian [link], with a 
multiplier T that depends on the quantization step. Optimizing the coder is 
thus equivalent to minimizing this approximation Lagrangian. For sparse 
representations, most of the bits are devoted to coding the geometry of the 
Sparse approximation set A; in y. 


Estimators reducing noise from observations X = f + W are also 
optimized by finding a best orthogonal projector over a set of dictionary 
vectors. The model selection theory of Barron, Birgé, and Massart (massart- 
birge-barron) proves that finding Ar, which minimizes this same 
Lagrangian % (JT, X, A), defines an estimator that has a risk on the same 
order as the minimum approximation error || f—f,, ||? up to a logarithmic 
factor. This is similar to the optimality result obtained for thresholding 
estimators in an orthonormal basis. 


The bad news is that minimizing the approximation Lagrangian Lo is an 
NP-hard problem and is therefore computationally intractable. It is 
necessary therefore to find algorithms that are sufficiently fast to compute 
suboptimal, but “good enough,” solutions. 


Dictionaries of Orthonormal Bases 


To reduce the complexity of optimal approximations, the search can be 
reduced to subfamilies of orthogonal dictionary vectors. In a dictionary of 
orthonormal bases, any family of orthogonal dictionary vectors can be 
complemented to form an orthogonal basis & included in Y. As a result, 
the best approximation of f from orthogonal vectors in Z& is obtained by 
thresholding the coefficients of f in a “best basis” in J. 


For tree dictionaries of orthonormal bases obtained by a recursive split of 
orthogonal vector spaces, the fast, dynamic programming algorithm of 
Coifman and Wickerhauser (CoifmanMW:92) finds such a best basis with 
O(P) operations, where P is the dictionary size. 


Wavelet packet and local cosine bases are examples of tree dictionaries of 
time-frequency orthonormal bases of size P = N log, N.A best basis is a 
time-frequency tiling that is the best match to the signal time-frequency 
structures. 


To approximate geometrically regular edges, wavelets are not as efficient as 
curvelets, but wavelets provide more sparse representations of singularities 
that are not distributed along geometrically regular curves. Bandlet 
dictionaries, introduced by Le Pennec, Mallat, and Peyré (bandelets-siam, 
bandlets-peyre), are dictionaries of orthonormal bases that can adapt to the 
variability of images' geometric regularity. Minimax optimal asymptotic 
rates are derived for compression and denoising. 


Pursuit in Dictionaries 


Approximating signals only from orthogonal vectors brings rigidity that 
limits the ability to optimize the representation. Pursuit algorithms remove 


this constraint with flexible procedures that search for sparse, although not 
necessarily optimal, dictionary approximations. Such approximations are 


computed by optimizing the choice of dictionary vectors {pp te ae 


Matching Pursuit 


Matching pursuit algorithms introduced by Mallat and Zhang (MallatZ:93) 
are greedy algorithms that optimize approximations by selecting dictionary 
vectors one by one. The vector in y,, € F that best approximates a signal 


f is 
(r *») 


Equation: 


@,, = argmax 
pel 


and the residual approximation error is 
Equation: 


kf =f- (f; Poo) Poo 


A matching pursuit further approximates the residue Rf by selecting 
another best vector yp, from the dictionary and continues this process over 
next-order residues R™ f, which produces a signal decomposition: 
Equation: 


M-1 


f=, (Re f, ems) Sp, + R™ f. 


m=0 


The approximation from the M-selected vectors {p,, }o<mem Can be 
refined with an orthogonal back projection on the space generated by these 
vectors. An orthogonal matching pursuit further improves this 
decomposition by orthogonalizing progressively the projection directions 


(py, during the decompositon. The resulting decompositions are applied to 
compression, denoising, and pattern recognition of various types of signals, 
images, and videos. 


Basis Pursuit 


Approximating f with a minimum number of nonzero coefficients a| p] in a 
dictionary J is equivalent to minimizing the 1° norm || a ||), which gives 
the number of nonzero coefficients. This 1° norm is highly nonconvex, 
which explains why the resulting minimization is NP-hard. Donoho and 
Chen (DonohoC:95) thus proposed replacing the 1° norm by the 11 norm 

| @ I], = dopey 1a [p]|, which is convex. The resulting basis pursuit 
algorithm computes a synthesis operator 

Equation: 


= Sa[p] Yp, Which minimizes || a ||, = S° |a [ pl]. 
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This optimal solution is calculated with a linear programming algorithm. A 
basis pursuit is computationally more intense than a matching pursuit, but it 
is amore global optimization that yields representations that can be 
moresparse. 


In approximation, compression, or denoising applications, f is recovered 
with an error bounded by a precision parameter €. The optimization [link] 
is thus relaxed by finding a synthesis such that 

Equation: 


| fF Sla[plypll< e, which minimizes || a ||, = S> |a[z]l. 


pey pey 


This is a convex minimization problem, with a solution calculated by 
minimizing the corresponding 1! Lagrangian 
Equation: 


A(T, f,a) =|| f- >) ale) gp |? +7 ll Mla, 


pcy 


where T is a Lagrange multiplier that depends on €. This is called an 1! 
Lagrangian pursuit in this book. A solution G | p] is computed with iterative 
algorithms that are guaranteed to converge. The number of nonzero 
coordinates of @ typically decrea-ses as T increases. 


Incoherence for Support Recovery 


Matching pursuit and 1! Lagrangian pursuits are optimal if they recover the 
approx-imation support A7, which minimizes the approximation Lagrangian 
Equation: 


Lo (T, ue x) =|| Seas I)? ae AI, 


where fy is the orthogonal projection of f in the space V, generated by 
{Pp} <a: this is not always true and depends on Az. An Exact Recovery 
Criteria proved byTropp (tropp-multi-omp) guarantees that pursuit 
algorithms do recover the optimal support, if 

Equation: 


ERC (rr) =max Y* |(Bp»%0)| <1 
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where {%p} <4, is the biorthogonal basis of {yp}... 4, in Vay. This 


criterion implies that dictionary vectors @g outside Ay should have a small 
inner product with vectors in Ar. 


This recovery is stable relative to noise perturbations if {y,} pea has Riesz 
bounds that are not too far from 1. These vectors should be nearly 

orthogonal and hence have small inner products. These small inner-product 
conditions are interpreted as a form of incoherence. A stable recovery of Ar 


is possible if vectors in Ay are incoherent with respect to other dictionary 
vectors and are incoherent between themselves. It depends on the geometric 
configuration of A; in y. 


Inverse Problems 

This collection comprises Chapter 1 of the book A Wavelet Tour of Signal 
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Most digital measurement devices, such as cameras, microphones, or 
medical imaging systems, can be modeled as a linear transformation of an 
incoming analog signal, plus noise due to intrinsic measurement 
fluctuations or to electronic noises. This linear transformation can be 
decomposed into a stable analog-to-digital linear conversion followed by a 
discrete operator U that carries the specific transfer function of the 
measurement device. The resulting measured data can bewritten 
Equation: 


where is the high-resolution signal we want to recover, and 

is the measurement noise. For a camera with an optic that is out of focus, 
the operator U is a low-pass convolution producing a blur. For a magnetic 
resonance imaging system, U is a Radon transform integrating the signal 
along rays and the number Q of measurements is smaller than N. In such 
problems, U is not invertible and recovering an estimate of _ is an ill-posed 
inverse problem. 


Inverse problems are among the most difficult signal-processing problems 
with considerable applications. When data acquisition is difficult, costly, or 
dangerous, or when the signal is degraded, super-resolution is important to 
recover the highest possible resolution information. This applies to satellite 
observations, seismic exploration, medical imaging, radar, camera phones, 
or degraded Internet videos displayed on high-resolution screens. 
Separating mixed information sources from fewer measurements is yet 
another super-resolution problem in telecommunication or audio 
recognition. 


Incoherence, sparsity, and geometry play a crucial role in the solution of ill- 
defined inverse problems. With a sensing matrix U with random 
coefficients, Candés and Tao (candes-near-optimal) and Donoho (donoho- 
cs) proved that super-resolution becomes stable for signals having a 
sufficiently sparse representation in a dictionary. This remarkable result 
opens the door to new compression sensing devices and algorithms that 
recover high-resolution signals from a few randomized linear 
measurements. 


Diagonal Inverse Estimation 


In an ill-posed inverse problem, 
Equation: 


the image space of U is of dimension Q 
smaller than the high-resolution space N where _ belongs. Inverse 
problems include two difficulties. In the image space ImU, where U is 
invertible, its inverse may amplify the noise W, which then needs to be 
reduced by an efficient denoising procedure. In the null space NullU, all 
signals h are set to zero and thus disappear in the measured data Y. 
Recovering the projection of in NullU requires using some strong prior 
information. A super-resolution estimator recovers an estimation of ina 
dimension space larger than Q and hopefully equal to N, but this is not 
alwayspossible. 


Singular Value Decompositions 


Let be the representation of in an orthonormal 
basis . An approximation must be recovered from 
Equation: 


A basis _ of singular vectors diagonalizes . Then U transforms a 
subset of Q vectors of into an orthogonal basis 


of ImU and sets all other vectors to zero. A singular value decomposition 
estimates the coefficients of by projecting Y on this singular basis 
and by renormalizing the resultingcoefficients 

Equation: 


where hm are regularization parameters. 


Such estimators recover nonzero coefficients in a space of dimension Q and 
thus bring no super-resolution. If U is a convolution operator, then __ is the 
Fourier basis and a singular value estimation implements a regularized 
inverseconvolution. 


Diagonal Thresholding Estimation 


The basis that diagonalizes rarely provides a sparse signal 
representation. For example, a Fourier basis that diagonalizes convolution 
operators does not efficiently approximate signals including singularities. 


Donoho (Donoho:95) introduced more flexibility by looking for a basis 
providing a sparse signal representation, where a subset of Q vectors 
are transformed by U in a Riesz basis of ImU, 


while the others are set to zero. With an appropriate renormalization, 
has a biorthogonal basis that is normalized 


. The sparse coefficients of in can then be estimated with a 
thresholding 


Equation: 


for thresholds T,, appropriately defined. 


For classes of signals that are sparse in __, such thresholding estimators 
may yield a nearly minimax risk, but they provide no super-resolution since 
this nonlinear projector remains in a space of dimension Q. This result 
applies to classes of convolution operators U in wavelet or wavelet packet 
bases. Diagonal inverse estimators are computationally efficient and 
potentially optimal in cases where super-resolution is not possible. 


Super-resolution and Compressive Sensing 


Suppose that _ has a sparse representation in some dictionary 
of P normalized vectors. The P vectors of the transformed 

dictionary belong to the space ImU of dimension 

and thus define a redundant dictionary. Vectors in the 
approximation support A of are not restricted a priori to a particular 
subspace of _ . Super-resolution is possible if the approximation support A 
of in canbe estimated by decomposing the noisy data Y over Dy. It 
depends on the properties of the approximation support A of iny. 


Geometric Conditions for Super-resolution 


Let be the approximation error of a sparse representation 
of .The observed signal can be written as 


Equation: 


If the support A can be identified by finding a sparse approximation of Y in 
Du 
Equation: 


then we can recover a super-resolution estimation of 
Equation: 


This shows that super-resolution is possible if the approximation support A 
can be identified by decomposing Y in the redundant transformed dictionary 
Dy. If the exact recovery criteria is satisfy and if 


is a Riesz basis, then A can be recovered using pursuit algorithms with 
controlled error bounds. 


For most operator U, not all sparse approximation sets can be recovered. It 
is necessary to impose some further geometric conditions on A in y, which 
makes super-resolution difficult and often unstable. Numerical applications 
to sparse spike deconvolution, tomography, super-resolution zooming, and 
inpainting illustrate these results. 


Compressive Sensing with Randomness 


Candés and Tao (candes-near-optimal), and Donoho (donoho-cs) proved 

that stable super-resolution is possible for any sufficiently sparse signal _ if 

U is an operator with random coefficients. Compressive sensing then 

becomes possible by recovering a close approximation of from 
linear measurements (candes-cs-review). 


A recovery is stable for a sparse approximation set only if the 
corresponding dictionary family is a Riesz basis of the space it 
generates. The M-restricted isometry conditions of Candés, Tao, and 
Donoho (donoho-cs) imposes uniform Riesz bounds for all sets with 


Equation: 


This is a strong incoherence condition on the P vectors of : 


which supposes that any subset of less than M vectors is nearly uniformly 
distributed on the unit sphere of ImU. 


For an orthogonal basis , this is possible for 


if U is a matrix with independent Gaussian random 
coefficients. A pursuit algorithm then provides a stable approximation of 
any having a sparse approximation from vectors in 


These results open a new compressive-sensing approach to signal 
acquisition and representation. Instead of first discretizing linearly the 
signal at a high-resolution N and then computing a nonlinear representation 
over M coefficients in some dictionary, compressive-sensing Measures 
directly M randomized linear coefficients. A reconstructed signal is then 
recovered by a nonlinear algorithm, producing an error that can be of the 
same order of magnitude as the error obtained by the more classic two-step 
approximation process, with a more economic acquisiton process. These 
results remain valid for several types of random matrices U. Examples of 
applications to single-pixel cameras, video super-resolution, new analog-to- 
digital converters, and MRI imaging are described. 


Blind Source Separation 


Sparsity in redundant dictionaries also provides efficient strategies to 
separate a family of signals that are linearly mixed in 


observed signals with noise: 
Equation: 


From a stereo recording, separating the sounds of S musical instruments is 
an example of source separation with . Most often the mixing matrix 
is unknown. Source separation is a super- 
resolution problem since data values must be recovered from 
measurements. Not knowing the operator U makes it 
even more complicated. 


If each source has a sparse approximation support A; ina dictionary , 
with , then it is likely that the sets are nearly 
disjoint. In this case, the operator U, the supports A,, andthe sources _—are 
approximated by computing sparse approximations of the observed data Y;, 
in . The distribution of these coefficients identifies the coefficients of the 
mixing matrix U and the nearly disjoint source supports. Time-frequency 
separation of sounds illustrate these results. 


Travel Guide 

This collection comprises Chapter 1 of the book A Wavelet Tour of Signal 
Processing, The Sparse Way (third edition, 2009) by Stéphane Mallat. The 
book's website at Academic Press is 
http://www.elsevier.com/wps/find/bookdescription.cws_home/714561/descr 
iption#description The book's complementary materials are available at 
http://wavelet-tour.com 


Travel Guide 


Reproducible Computational Science 


This book covers the whole spectrum from theorems on functions of 
continuous variables to fast discrete algorithms and their applications. [link] 
argues that models based on continuous time functions give useful 
asymptotic results for understanding the behavior of discrete algorithms. 
Still, a mathematical analysis alone is often unable to fully predict the 
behavior and suitability of algorithms for specific signals. Experiments are 
necessary and such experiments should be reproducible, just like 
experiments in other fields of science (DonohoB:95). 


The reproducibility of experiments requires having complete software and 
full source code for inspection, modification, and application under varied 
parameter settings. Following this perspective, computational algorithms 
presented in this book are available as MATLAB subroutines or in other 
software packages. Figures can be reproduced and the source code is 
available. Software demonstrations and selected exercise solutions are 
available at http://wavelet-tour.com. For the instructor, solutions are 
available at www.elsevierdirect.com/9780123743701. 


Book Road Map 


Some redundancy is introduced between sections to avoid imposing a linear 
progression through the book. The preface describes several possible 
programs for a sparse signal-processing course. 


All theorems are explained in the text and reading the proofs is not 
necessary to understand the results. Most of the book's theorems are proved 
in detail, and important techniques are included. Exercises at the end of 
each chapter give examples of mathematical, algorithmic, and numeric 
applications, ordered by level of difficulty from 1 to 4, and selected 
solutions can be found at http://wavelet-tour.com. 


The book begins with Chapters 2 and 3, which review the Fourier transform 
and linear discrete signal processing. They provide the necessary 
background for readers with no signal-processing background. Important 
properties of linear operators, projectors, and vector spaces can be found in 
the Appendix. Local time-frequency transforms and dictionaries are 
presented in Chapter 4; the wavelet and windowed Fourier transforms are 
introduced and compared. The measurement of instantaneous frequencies 
illustrates the limitations of time-frequency resolution. Dictionary stability 
and redundancy are introduced in Chapter 5 through the frame theory, with 
examples of windowed Fourier, wavelet, and curvelet frames. Chapter 6 
explains the relationship between wavelet coefficient amplitude and local 
signal regularity. It is applied to the detection of singularities and edges and 
to the analysis of multifractals. 


Wavelet bases and fast filter bank algorithms are important tools presented 
in Chapter 7. An overdose of orthonormal bases can strike the reader while 
studying the construction and properties of wavelet packets and local cosine 
bases in Chapter 8. It is thus important to read Chapter 9, which describes 
Sparse approximations in bases. Signal-compression and denoising 
applications described in Chapters 10 and 11 give life to most theoretical 
and algorithmic results in the book. These chapters offer a practical 
perspective on the relevance of linear and nonlinear signal-processing 
algorithms. Chapter 12 introduces sparse decompositions in redundant 
dictionaries and their applications. The resolution of inverse problems is 
studied in Chapter 13, with super-resolution, compressive sensing, and 
source separation. 


